Lexical Profiling of Existing Web Directories to Support Fine-grained Topic-Focused Web Crawling
نویسندگان
چکیده
Topic-focused Web crawling aims to harness the potential of the Internet reliably and efficiently, producing topic specific indexes of pages within the Web. Previous work has focused on supplying suitably general descriptions of topics to generate large general indexes. In this paper we propose a method that uses lexical profiling of a corpus that consists of hierarchical structures in existing Web Directories to specify finer-grained topics on smaller training examples, while using the seemingly redundant information in related topics to make the process of gathering pages more efficient. We also suggest a link scoring formula that combines content, context and page lexical similarities to a given topic to prioritise the links for crawling. The initial experiments with the Open Directory Project show that the prioritised crawl provides significantly more pages than the breadth-first crawler. Also, the rate at which the number of relevant pages increases is much higher. Keeping the crawler close to the target subject allows “unproductive” periods to be reduced, by following links most likely to link to target pages.
منابع مشابه
Analyzing Fine-grained Hypertext Features for Enhanced Crawling and Topic Distillation
Early Web search engines closely resembled Information Retrieval (IR) systems which had matured over several decades. Around 1996–1999, it became clear that the spontaneous formation of hyperlink communities in the Web graph had much to offer to Web search, leading to a flurry of research on hyperlink-based ranking of query responses. In this paper we show that, over and above inter-page hyperl...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملFocused Crawling Techniques
The need for more and more specific reply to a web search query has prompted researchers to work on focused web crawling techniques for web spiders. Variety of lexical and link based approaches of focused web crawling are introduced in the paper highlighting important aspects of each. General Terms Focused Web Crawling, Algorithms, Crawling Techniques.
متن کاملEvolving Strategies for Focused Web Crawling
The rapid growth of the World Wide Web has created many challenges for both general purpose crawling, search engines and web directories, making it difficult to find, index, and classify web pages based on a topic. Topic driven crawlers can complement search engines because they pre-classify the pages retrieved by the crawl. To implement such a focused crawler, a strategy for ordering the crawl...
متن کاملOntology-Focused Crawling of Web Documents and RDF-based Metadata
The enormous growth of the World Wide Web in recent years has made it important to develop document discovery mechanisms based on intelligent and focused crawling techniques. The next-generation Web, the Semantic Web, that is currently being developed as a meta Web, building on the existing one, changes the classical crawling task. Metadata that is based on ontologies will exist in the form of ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009